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Abstract 

A  critical  challenge  in  synthesis  techniques  for  iterative  applications  is  the  efficient  analysis  of  performance  in 
the  presence  of  communication  resource  contention.  To  address  this  challenge,  we  introduce  the  concept  of  the  period 
graph.  The  period  graph  is  constructed  from  the  output  of  a  simulation  of  the  system,  with  idle  states  included  in  the 
graph,  and  its  maximum  cycle  mean  is  used  to  estimate  overall  system  throughput.  As  an  example  of  the  utility  of  the 
period  graph,  we  demonstrate  its  use  in  a  joint  power/performance  optimization  solution  that  uses  either  a  nested 
genetic  algorithm,  or  a  simulated  annealing  algorithm.  We  analyze  the  fidelity  of  this  estimator,  and  quantify  the 
speedup  and  optimization  accuracy  obtained  compared  to  simulation. 

1  Introduction 

In  many  practical  multiprocessor  systems,  there  is  contention  for  one  or  more  shared  communication  resources. 
One  example  of  this  is  a  shared  bus.  A  processor  must  first  gain  access  to  the  bus  before  it  can  execute  an  interproces¬ 
sor  communication  ( IPC )  operation.  One  consequence  of  this  contention  is  that  under  self-timed,  iterative  execution, 
there  is  no  known  method  for  deriving  an  analytical  expression  for  the  throughput  of  the  system  [19],  and  thus,  simu¬ 
lation  is  required  to  get  a  clear  picture  of  application  performance.  However,  simulation  is  computationally  very 
expensive,  and  it  is  highly  undesirable  to  perform  simulation  inside  the  innermost  optimization  loop  during  synthesis. 
To  avoid  such  a  simulation,  an  accurate  and  efficient  estimator  for  throughput  is  required.  This  paper  presents  an  effi¬ 
cient  estimator  for  the  throughput  of  these  systems.  Our  work  is  in  the  context  of  self-timed  execution  of  iterative 
dataflow  specifications,  which  is  an  efficient  and  popular  design  methodology  in  the  domain  of  digital  signal  process¬ 
ing  (DSP)  [13].  An  iterative  dataflow  specification  consists  of  a  dataflow  representation  of  the  body  of  a  loop  that  is 
to  be  iterated  a  large  or  indefinite  number  of  times  (e.g.,  across  a  vast  stream  of  speech  samples).  In  self-timed  execu- 
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tion,  the  assignment  of  tasks  (dataflow  graph  nodes)  to  processors,  and  the  execution  ordering  of  tasks  on  each  processor  are 
determined  at  compile-time,  and  at  run-time,  processors  synchronize  with  one  another  only  based  on  inter-processor  com¬ 
munication  requirements,  and  do  not  necessarily  synchronize  at  the  end  of  each  loop  iteration. 

In  this  paper,  we  assume  that  a  deterministic  protocol  is  used  to  arbitrate  contention  for  communication  resources.  We 
assume  that  a  schedule  has  already  been  computed  so  the  order  of  the  tasks  on  the  processors  is  known,  and  that  we  are 
adjusting  some  task  parameters  that  vary  the  task  execution  times  in  order  to  perform  an  optimization  of  the  system.  We 
assume  that  reasonably  accurate  estimates  are  available  for  the  task  execution  times,  and  for  the  variation  of  execution  times 
with  parameter  changes.  Later  in  the  paper,  we  specifically  address  the  problem  of  finding  an  optimum  set  of  supply  volt¬ 
ages  for  the  processors  in  order  to  reduce  power  while  satisfying  a  throughput  constraint. 

2  Previous  work 

The  estimates  for  task  execution  times  can  be  obtained  through  several  methods.  The  most  straightforward  is  for  the 
programmer  to  provide  them  while  developing  a  library  of  primitive  blocks,  as  is  done  in  the  Ptolemy  system  [3],  Analytical 
techniques  also  exist.  Li  and  Malik  [14]  have  proposed  algorithms  for  estimating  the  execution  time  of  embedded  software 
in  an  efficient  manner.  Much  work  has  been  done  on  scheduling  and  binding  methods  for  high  level  synthesis  [16]  [8]  [9]  [7] . 
These  techniques  attempt  to  optimize  the  schedule  makespan ,  which  is  a  suitable  performance  metric  for  non-iterative  appli¬ 
cations  or  fully-static  implementations,  but  is  not  ideally  suited  to  the  iterative,  self-timed  context  that  we  address  in  this 
paper.  Supply  voltage  reduction  has  been  used  for  some  time  in  memories  and  consumer  electronics  [15].  Chandrakasan  et 
al.  [5] [6]  have  presented  a  method  based  on  reduced  voltage  level  operation  combined  with  architectural-level  parallelism, 
showing  that  the  throughput  can  be  maintained  while  reducing  power.  Tiwari  et  al.  [21]  presented  a  technique  for  estimating 
the  power  given  a  set  of  software  instructions.  This  technique  can  be  used  in  conjunction  with  the  approaches  proposed  in 
this  paper  to  obtain  more  accurate  or  automated  estimates  for  the  power  consumption  of  the  tasks  in  period  graph  model. 

The  period  graph  model  is  inspired  by  the  synchronization  graph  model  [19]  for  self-time  multiprocessor  systems. 
The  synchronization  graph  has  proven  useful  for  a  variety  of  techniques  for  minimizing  synchronization  overhead,  allocat¬ 
ing  interprocessor  communication  buffers,  and  scheduling  communication  operations  [11,  19].  The  period  graph  concept 
differs  from  the  synchronization  graph  model  in  that  it  explicitly  models  steady-state  behavior  under  communication 
resource  contention,  which  is  not  accounted  for  in  synchronization  graphs. 

A  preliminary,  partial  summary  of  this  paper  has  been  published  in  [2], 

3  Period  Graph 

If  contention  is  resolved  deterministically,  and  execution  times  are  constant,  then  self-timed  evolution  may  lead  to  an 
initial  transient  state,  but  the  execution  will  eventually  become  periodic.  This  holds  because  the  multiprocessor  may  be  mod¬ 
eled  as  a  finite-state  system,  and  thus,  aperiodic  behavior  —  which  implies  the  presence  of  infinitely  many  distinct  states  — 
cannot  hold.  In  DSP  systems,  although  execution  times  are  not  always  constant,  or  known  precisely,  they  typically  adhere 
closely  to  their  respective  estimates  with  high  frequency.  Under  such  conditions,  the  periodic  execution  pattern  obtained 
from  the  estimated  execution  times  provides  an  estimate  of  overall  system  throughput  based  on  the  task-level  estimates.  Due 


to  the  largely  deterministic  nature  of  DSP  applications,  such  system-level  performance  analysis,  and  optimization  based  on 
task-level  estimates  is  common  practice  in  the  DSP  design  community  [13], 

For  self-timed  systems,  when  we  apply  execution  time  estimates  to  estimate  overall  throughput,  it  is  necessary  to  sim¬ 
ulate  (using  the  execution  time  estimates)  past  the  transient  state  until  a  periodic  execution  pattern  (steady  state)  emerges. 
Unfortunately,  the  duration  of  the  transient  may  be  exponential  in  the  size  of  the  application  specification  [19],  and  this 
makes  simulation-intensive,  iterative  synthesis  approaches  highly  unattractive. 

The  objective  in  this  paper  is  to  greatly  reduce  the  rate  at  which  simulation  must  be  carried  out  during  iterative  synthe¬ 
sis  through  the  use  of  a  novel  period  graph  model.  Given  an  assignment  v  of  task  execution  times,  and  a  self-timed  sched¬ 
ule,  the  associated  period  graph  is  constructed  from  the  periodic,  steady-state  pattern  of  the  resulting  simulation.  The 
maximum  cycle  mean  (MCM)  of  the  period  graph  (with  certain  adjustments)  is  then  used  as  a  computationally-efficient 
means  of  estimating  the  iteration  period  (the  reciprocal  of  the  throughput)  as  changes  are  explored  within  a  neighborhood  of 
v  .  In  this  context,  the  MCM  is  the  maximum  over  all  directed  cycles  of  the  sum  of  the  task  execution  times  divided  by  the 
sum  of  the  edge  delays.  The  MCM  can  be  computed  in  low  polynomial  time  [12]. 

The  first  step  in  the  construction  of  the  period  graph  is  the  identification  of  the  period  from  the  simulator  output.  This 
can  be  performed  by  tracing  backward  through  the  simulation  and  searching  for  the  latest  intermediate  time  instant  t  at 
which  the  system  state  S(ta)  equals  the  state  S(tji)  obtained  at  the  end  of  the  simulation  (here,  t y  denotes  the  simulation 
time  limit).  If  no  match  is  found,  then  the  end  of  the  first  period  exceeds  /y,  and  thus,  the  simulation  needs  to  be  extended 
beyond  ty.  Otherwise,  the  region  (Gantt  chart)  that  spans  the  interval  [ta,  ty]  constitutes  a  (minimal)  period  of  the  simulated 
steady  state. 

Here,  the  system  state  S(t)  contains  the  execution  state  of  each  processor,  which  is  either  “idle”  or  representable  by  an 
ordered  pair  (A,  x) ,  where  A  is  the  task  being  executed  at  time  t ,  and  x  denotes  the  time  remaining  until  the  current  invo¬ 
cation  of  A  is  completed.  The  state  S(t)  also  contains  the  current  buffer  sizes  of  all  IPC  buffers,  as  well  as  any  information 
(e.g.,  request  queue  status)  that  is  used  by  the  protocol  for  resolution  of  communication  contention.  Our  approach  to  effi¬ 
ciently  determining  this  period  is  as  follows: 

•  Perform  a  simulation  of  the  schedule  for  some  time  T ■  .  Define  a  constant  C ,  which  is  an  initial  estimate  for  the  num¬ 
ber  of  complete  cycles  (invocations)  of  the  graph  that  must  be  simulated  in  order  to  find  a  period.  This  constant  repre¬ 
sents  the  length  of  the  initial  transient,  before  the  output  becomes  periodic.  If  this  initial  estimate  is  too  low,  it  will  be 
increased  during  the  algorithm.  Let  N  be  the  number  of  processors,  and  let  n-  be  the  number  of  tasks  scheduled  on  pro¬ 
cessor  j ,  where  j  e  [ I ,/V] .  Tasks  include  IPC  tasks  as  well  as  computational  tasks.  Label  these  tasks  Vy  ,  V2  ...  Vn  .  We 
consider  the  case  where  the  system  executes  these  tasks  infinitely.  The  invocation  number  of  a  task  is  defined  as  the 
number  of  times  a  given  task  has  executed,  and  is  denoted  with  a  superscript.  For  example,  denotes  the  bfh 

invocation  of  task  a  on  processor  j .  Define  a  simulation  array  for  each  processor  S i try  [ i ]  where  i  e  [  1 M ■  I  and  M-  is 
the  number  of  tasks  on  processor  j  that  were  output  by  the  simulator.  The  elements  of  the  simulation  array  are  the  tasks, 
and  are  ordered  by  reverse  start  time,  so  that  Start(Siiry[/])  >  Start(Sim-[i  +  1]) . 


•  Create  two  idle  vectors  of  length  n .  for  each  processor  spanning  one  invocation.  Label  the  first  idle  vector  Idle  1  \j[k] 

where  k  e  [1  ,n .] .  Label  the  second  idle  vector  Idle21;[fe] . 

J  J 

•  Examine  the  IPC  buffer  vector  at  some  fixed  point  of  each  idle  vector.  The  IPC  buffer  vector  consists  of  the  numbers  of 
tokens  queued  on  all  the  IPC  edges  of  the  graph  enumerated  in  some  order.  The  IPC  buffer  vector  must  be  output  by  the 
simulator  at  least  once  every  graph  iteration.  For  example,  the  simulator  could  output  an  IPC  buffer  vector  for  each  pro¬ 
cessor  every  time  the  processor  executes  the  first  task  scheduled  on  it.  In  this  way,  each  idle  vector  would  be  associated 
with  one  IPC  buffer  vector.  Label  these  vectors  IPCBufl^g]  and  I  PC  B  u  f2 .  [  q  ]  where  q  e  [1,£]  and  E  is  the  number 
of  edges  in  the  IPC  graph.  The  IPC  buffer  vector  represents  the  state  of  the  communication  buffers  in  the  system.  Let 
Tokens(e,  t)  be  the  number  of  data  tokens  on  edge  e  at  time  t .  Let  TaskNum^f)  be  the  number  of  the  node  that  is  exe¬ 
cuting  on  processor  j  at  time  t .  Pseudo-code  for  constructing  the  period  is  shown  in  Figure  2: 


Our  experience  suggests  that  in  practice,  most  graphs  have  periods  spanning  only  a  few  invocations,  so  the  above  procedure 
for  finding  the  period  is  efficient.  For  a  system  with  a  period  that  spans  N  invocations  and  with  at  most  L  tasks  per  proces¬ 
sor,  this  method  requires  LN(N  +  1)  comparisons. 

Figure  1(a)  illustrates  an  application  graph  (a  dataflow  specification  of  an  application)  along  with  a  self-timed  sched- 


Figure  1  An  illustration  of  the  period  graph  model. 


Procedure  CalculatePeriod 


estimate  initial  integer  C  and  initial  simulation  time  rsi 
int  mine  =  0 
while  (  mine  <  C) 

Increment  rsim 
Simulate  for  T- 


mine  =  minimum  over  all  j 


endwhile 


t  =  0 

for  (r  =  0;r<rsim;r++) 
for  (j  =  1...N) 

cij  =  TaskNurm(f) 

invocation/.[fl]++ 

b ■  =  invocation^] 

if  ( TaskNum .( t)  >  TaskNum .( t  -  1 ) ) 

Siml;.[/]  =  Vb{J\{j) 

endif 

endfor 

endfor 

span  =  0 

Repeat 

span++ 

for  (k  =  1  ...span*nj ) 
for  (j  =  1...N) 

if( span*«^.  > Mj) 

errorplncrease  C  and  start  over”) 

endif 

Idle^ffe]  =  Finish(Sim.[fc])  -  Start(Sim.[A:  +  1]) 

Idle?[&]  =  Finish(Sim.[span*H^.  +  k])  -  Start ( S i irn [span * rij  +  1  +  k]) 

endfor 

for  (q  =  I...E) 

IPCBuflfg]  =  Tokensf^r,  StartfSimjfl])) 

IPCBuf2[^]  =  Tokens(^r,Start(Sim1[span*«1  +  1])) 
endfor 
endfor 

Until  (f] (Idle]  sidle?)  =  1 )  and  (IPCBufl  =  IPCBuG) 


Figure  2.  Pseudocode  specification  for  period  calculation. 


ule;  Figure  1(c)  shows  the  periodic  steady  state  that  results  from  the  schedule  of  Figure  1(a)  and  the  execution  time  estimates 
shown  in  Figure  1(b);  and  Figure  1(d)  shows  the  resulting  period  graph.  The  nodes  in  Figure  1(d)  that  contain  diagonal 
stripes  correspond  to  idle  time  ranges  in  the  period,  and  solid  black  circles  on  edges  represent  delays,  which  model  inter-iter¬ 
ation  dependencies.  Note  that  the  steady  state  period  may  span  multiple  graph  iterations  (2  in  this  example),  and  in  the 
period  graph,  this  translates  to  multiple  instances  of  each  application  graph  task. 

For  clarity  in  this  illustration,  we  have  assumed  negligible  latency  associated  with  IPC.  As  described  below,  non-negli- 
gible  IPC  costs  can  easily  be  accommodated  in  the  period  graph  model  by  introducing  send  and  receive  tasks  at  appropriate 
points. 

As  illustrated  in  Figure  1,  the  period  graph  consists  of  all  the  tasks  comprising  the  period  that  was  detected,  with  the 
idle  time  ranges  between  tasks  (including  those  that  are  caused  by  communication  contention)  also  treated  as  nodes  in  the 
graph.  The  nodes  are  connected  by  edges  in  the  order  that  they  appear  in  the  period.  An  edge  is  placed  from  the  last  node  in 
the  period  for  each  processor  to  the  first  node  in  the  period.  This  edge  is  given  a  delay  value  of  one  (to  model  the  associated 
transition  between  period  iterations),  while  all  of  the  other  intraprocessor  edges  have  delay  values  of  zero.  This  is  done  for 
all  the  processors  in  the  system.  Our  model  utilizes  send  and  receive  nodes  for  IPC.  For  each  IPC  point,  a  send  node  is 
placed  on  the  processor  that  is  sending  data,  and  a  corresponding  receive  node  is  placed  on  the  processor  that  will  receive  the 
data.  The  period  graph  is  completed  by  adding  an  edge  from  each  send  node  to  its  corresponding  receive  node. 


4  Fidelity  of  the  estimator 

We  calculate  the  fidelity  of  the  period  graph  estimator  as  the  task  execution  times  are  varied.  Here,  we  use  the  example 
of  varying  the  processor  voltages  in  order  to  change  the  task  execution  times.  When  the  voltage  on  a  processor  is  varied,  the 
execution  time  of  a  computational  task  varies  according  to 


delay 


V 


dd 


(1) 


(Vdd-Vt)2' 

where  V , ,  is  the  supply  voltage,  V,  is  the  threshold  voltage,  and  k  is  a  constant  [6],  We  use  a  value  of  0.8 volts  for  the 
threshold  voltage.  The  execution  time  pe i  of  each  of  these  states  in  the  original  (non-scaled)  period  graph  is  referenced  to  a 
voltage  Vrej.  The  change  in  execution  time  of  each  computation  node  is  found  by  taking  the  derivative: 


A  pei  =  pet 


f 

Vsc 

ri3i 

2  \ 

-  1 

Vref 

V 

L  v„f\ 

/ 

(2) 


where  Vs,c  is  the  new  voltage.  It  is  not  obvious,  however,  how  one  should  adjust  the  idle  times  in  the  period  graph.  We  sep¬ 
arate  the  idle  nodes  into  two  sets:  contention  idles  and  data  idles.  When  a  node  has  the  necessary  data  to  execute  (the  neces¬ 
sary  data  has  already  been  produced),  but  is  idle  waiting  for  access  to  the  bus,  the  associated  idle  node  is  classified  as  a 
contention  idle.  When  a  node  is  idle  waiting  for  its  predecessors’  data,  the  associated  idle  node  is  classified  as  a  data  idle.  By 
experimenting  with  a  large  number  of  application  graphs,  we  found  that  we  could  capture  the  effects  of  contention  and 
obtain  the  best  fidelity  by  zeroing  out  the  data  idles  and  leaving  the  contention  idles  constant  as  the  computation  idles  are 


scaled.  Using  these  rules,  the  fidelity  is  calculated  as  follows: 


•  Given  an  application  graph,  construct  a  valid  schedule.  We  used  the  dynamic  level  scheduling  algorithm  given  in  [17]. 
Next,  construct  the  period  graph  as  discussed  earlier.  Generate  N  voltage  vectors  (assignments  of  voltages  to  the  proces¬ 
sors  in  the  target  architecture).  For  each  voltage  vector,  perform  a  simulation  to  determine  the  throughput,  with  the  exe¬ 
cution  times  of  the  tasks  on  each  processor  given  by  (1)  according  to  the  voltage  on  the  processor.  Also,  obtain  an 
estimate  for  the  throughput  by  calculating  the  MCM  of  the  voltage-scaled  period  graph,  in  which  the  execution  times  of 
the  computation  nodes  are  given  by  (1),  and  the  execution  times  of  the  idle  nodes  are  as  explained  above. 

•  Calculate  the  fidelity  according  to: 

N- 1  N 

E  E  fij 

i  =  lj  =  i+  1 

where 

fl  if  signiSj-Sj )  =  sign(Mi  -  M  ) 

fu  =  \  J  J  ;  (4) 

[  0  otherwise 

(-1)  if  (x  <  0) 

sign(x)  =  -  0  if  (x  =  0)  ;  (5) 

1  if  Or  >  0) 

the  SjS  denote  the  simulated  throughput  values;  and  the  M- s  are  the  corresponding  estimates  from  the  period  graph. 

Figure  3  plots  Fidelity  for  a  six -processor  system  in  which  the  voltage  on  the  individual  processors  can  vary  between 
plus  or  minus  five  percent.  The  x-axis  represents  the  sum  of  the  absolute  values  of  the  voltage  changes  over  all  processors. 
Each  point  on  the  graph  is  a  fidelity  calculation  for  N  =  100  voltage  vectors.  A  value  of  one  is  a  “perfect”  fidelity.  It  can  be 
seen  that  in  the  range  shown,  the  fidelity  is  always  greater  than  0.65.  It  is  also  important  that  the  estimator  have  a  small  error 
at  each  point.  Figure  4  plots 

N 

E  [(Si-Mi)/Si],  (6) 

i  =  1 

It  can  be  seen  that  the  error  increases  as  the  voltage  vector  moves  away  from  the  reference  point,  and  that  the  estimate  is 
slightly  biased.  For  the  range  shown  in  the  graphs,  where  each  processor  voltage  is  changed  by  a  maximum  of  fifteen  per¬ 
cent,  the  error  is  less  than  four  percent. 

5  Using  the  Period  Graph  in  a  Joint  Power/Performance  Algorithm 

An  effective  way  to  reduce  power  consumption  of  a  processor  core  in  CMOS  technology  is  to  lower  the  supply  voltage 
level,  which  exploits  the  quadratic  dependence  of  power  on  voltage  [6],  Reducing  the  supply  voltage  also  has  the  effect  of 
decreasing  the  clock  speed  and  increasing  circuit  delay.  The  circuit  delay  can  be  modeled  by  ().  The  power  consumption  is 


given  by 


p  =  <*cLvy,  (?) 

where  /  is  the  clock  frequency,  CL  is  the  load  capacitance,  and  a  is  the  switching  activity  [6],  To  accommodate  the 
possibility  of  putting  processors  in  states  of  lower  switching  activity  during  idle  periods,  our  model  includes  a  param¬ 
eter  aid/e  for  the  idle  states,  and  a  parameter  an  ■ .  j.  for  the  computational  tasks,  where  OLjdl  <  a  jd/e  .  A 
more  detailed  power  analysis  could  assign  a  different  a  for  each  computational  task  if  that  data  were  available.  A  dif¬ 
ferent  power  optimization  technique,  which  can  be  used  in  conjunction  with  the  voltage  scaling  technique  presented 
here,  utilizes  a  nearly  complete  processor  shutdown  during  the  idle  periods  [10]  [20].  In  our  model,  this  would  corre¬ 
spond  to  a  -  j,  =  0  .  Our  model  for  the  power  is  the  average  energy  consumption  per  graph  iteration  period.  This  cor¬ 
responds  in  a  typical  DSP  system  to  the  average  energy  required  to  process  one  sample.  Here,  the  energy  of  each  node 
equals  its  power  times  its  execution  time. 

In  a  system  consisting  of  multiple  processors,  one  has  the  ability  to  choose,  within  a  certain  range,  the  (fixed) 
operating  voltage  on  each  processor.  This  opens  up  an  additional  degree  of  freedom  that  can  be  exploited  to  minimize 
the  system  power  consumption.  By  choosing  a  lower  voltage  of  a  processor  that  is  executing  tasks  that  are  not  on  the 
critical  path,  the  throughput  can  remain  unchanged  while  the  overall  power  consumption  is  reduced.  In  general,  a 
combination  of  raising  voltages  on  some  processors  while  lowering  others  can  yield  the  most  attractive  power/perfor¬ 
mance  solution. 

When  applying  voltage  scaling  to  a  multiprocessor  system,  the  valid  solution  space  is  typically  much  too  large 
to  search  by  brute-force  methods.  In  addition,  since  there  is  no  general  analytical  formula  for  calculating  the  through¬ 
put  of  these  systems  in  the  presence  of  communication  resource  contention,  each  candidate  solution  must  either  be 
simulated  or  estimated  using  some  heuristic. 

6  Genetic  algorithm  formulation 

To  demonstrate  the  general  utility  of  the  period  graph  based  performance  estimation  approach,  we  incorporated 
it  into  two  significantly  different  probabilistic  search  techniques  to  derive  two  different  algorithms  for  systematic 
voltage  scaling.  The  first  algorithm  presented  utilizes  the  framework  of  genetic  algorithms  (GAs)  [1],  The  specific 
GA  explored  here  consists  of  an  inner  GA  nested  within  an  outer  GA.  The  inner  GA  performs  a  local  search  around  a 
point  from  the  population  of  the  outer  GA,  using  the  MCM  of  the  period  graph  in  its  objective  function  as  an  estimate 
for  the  throughput.  A  period  constraint  Tconstraint  is  given  as  an  input  to  the  optimization  problem,  where  the  period 
is  the  reciprocal  of  the  throughput.  The  objective  function  calculates  the  power  consumption  associated  with  each 
solution  by  calculating  the  total  energy  per  period,  as  discussed  earlier.  If  the  period  associated  with  a  solution  vio¬ 
lates  the  period  constraint  (Tsolution>  Tconstraint) ,  the  power  consumption  is  multiplied  by  a  large  penalty  factor 
exp(  100(Tsolution  -  Tconst|.aint)) .  The  GA  attempts  to  minimize  this  objective  function. 

In  the  outer  loop,  a  population  of  N  ,  voltage  vectors  is  generated.  A  simulation  is  run  and  a  period  graph 
constructed  for  each  of  these  outer  loop  voltage  vectors.  For  each  of  the  outer  loop  voltage  vectors,  a  new  inner  loop 
population  is  generated  such  that  |Vouter(.  -  Vinner(|  <  e  for  i  e  A(  c,  where  (Vproc  is  the  number  of  processors. 


Vouter(.  is  the  voltage  on  processor  i  in  the  outer  population,  Vinner  -  is  the  voltage  on  processor  i  in  the  inner  pop¬ 
ulation,  and  £  is  a  user-defined  threshold.  The  inner  population  size  is  IV;  r .  The  inner  GA  then  performs  a  local 
search  using  this  population  for  a  number  of  generations  Generations inner  in  an  attempt  to  find  a  locally  optimal  volt¬ 
age  vector.  The  inner  GA  uses  the  MCM  of  the  period  graph  in  its  objective  function.  After  an  invocation  of  the  inner 
GA  is  finished,  one  simulation  is  performed  using  the  resulting  voltage  vector,  and  the  actual  throughput  for  this  point 
is  used  to  compute  its  fitness.  The  outer  loop  voltage  vector  is  then  replaced  with  this  locally-optimized  voltage  vec¬ 
tor  for  use  in  the  next  outer  loop  generation.  The  outer  loop  is  run  for  a  number  of  generations  Generations  t  . 


7  Simulated  annealing  algorithm 


Simulated  annealing  is  another  well-known  method  for  searching  large  design  spaces.  Using  a  standard  simu¬ 
lated  annealing  package  [4],  we  have  implemented  an  alternative  version  of  period-graph-based  voltage  scaling  opti¬ 
mization.  The  objective  function  here  is  the  same  as  for  the  genetic  algorithm.  The  system  is  first  simulated  with  an 
initial  voltage  vector  V .  =  LS\7 ,  and  the  period  graph  is  built.  In  order  to  insure  that  the  period  graph  will  be  a  good 
enough  estimator,  a  resimulation  threshold  7  is  maintained.  The  difference  between  the  current  input  CV-  to  the 
objective  function,  and  the  voltage  vector  LS  V .  corresponding  to  the  simulation  used  to  compute  the  current  period 
graph,  is  calculated.  If 


N 

-  V 
N  ^ 


i  =  1 


V.-LSVJ 


LSV, 


>  7, 


(8) 


the  graph  is  resimulated  using  C V  • .  The  period  graph  is  rebuilt,  and  C V  •  — >  LSV^- .  For  7=0,  the  graph  will  be  res¬ 
imulated  every  time,  and  the  period  graph  will  offer  no  speed  advantage.  The  larger  the  value  of  7,  the  less  often  the 
graph  will  be  resimulated,  and  the  faster  the  optimization  algorithm  will  perform.  However,  when  7  is  too  large,  the 
fidelity  of  the  period  graph  estimate  will  be  unacceptably  low  and  the  quality  of  the  final  result  will  suffer.  Based  on 
our  experiments  with  a  number  of  graphs,  the  optimal  value  of  7  is  highly  application-dependent,  but  a  value  of 
7  =  0.1  (10%)  generally  gives  good  results. 


8  Results 

Figure  5  shows  an  example  of  the  reduction  in  power  resulting  from  the  genetic  optimization  algorithm  on  the 
FFT3  application  graph  (Figure  10).  The  parameters  of  the  GA  were  N  ,  =  IV;  r  =  50  ,  Generations  (  =  10 , 

Generations ;  =  20  .  The  local  search  voltages  were  constrained  to  be  within  five  percent  of  the  corresponding 

outer  loop  voltages.  The  period  constraint  was  calculated  by  simulating  the  system  with  all  six  processors  operating  at 
voltage  Vrej.  For  this  example,  the  system  power  consumption  was  reduced  by  43%,  while  maintaining  the  original 
throughput.  To  evaluate  the  advantage  of  the  period  graph  approach  over  using  brute-force  simulation,  a  second 
nested  GA  was  implemented.  This  algorithm  was  identical  to  the  algorithm  discussed  above,  except  that  the  inner 
loop  did  not  use  the  period  graph  estimate  for  the  throughput.  Instead,  each  voltage  vector  was  evaluated  by  simula¬ 
tion.  This  algorithm  consumed  26  times  more  CPU  time,  and  produced  similar  results,  as  shown  in  Figure  6. 

Figure  7  summarizes  the  power  reduction  results  for  the  simulated  annealing  algorithm  applied  to  a  fast  Fourier 


Genetic  algo.  fft3  (fixed  throughput  constraint)  using  period  graph 


Figure  5.  Plot  of  ( optimized  power)/(initial  power)  vs.  genetic  algorithm  iteration  using  the 
period  graph  estimator. 


Genetic  algo.  fft3  (fixed  throughput  constraint)  using  simulation  only 


Figure  6.  Plot  of  {optimized power)/ (initial power)  vs.  genetic  algorithm  iteration  using  simu¬ 
lation  only. 


transform  (FFT3)  application  graph,  for  different  values  of  the  resimulation  threshold  T .  It  can  be  seen  that  as  T  is 
increased,  the  algorithm  progresses  more  quickly.  The  simulated  annealing  algorithm  begins  with  a  ‘melting’  routine, 
where  the  temperature  is  increased  until  a  phase  change  is  detected.  The  initial  flat  part  of  the  curves  corresponds  to 
the  time  spent  in  the  melting  routine.  We  have  found  that  for  values  of  T  above  20%,  the  period  graph  is  not  a  good 
enough  estimator  and  the  algorithm  does  not  converge. 

Table  1  summarizes  the  power  reduction  for  the  simulated  annealing  algorithm  for  several  additional  applica- 


simulated  annealing  FFT3 


Figure  7.  Plot  of  {optimized power)/ (initial power)  vs.  time  for  simulated  annealing  algorithm 
on  FFT3  application. 
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0.95 

0.65 

0.6 

1 

fft2  (28) 

0.97 
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0.71 
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1 
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0.77 

0.59 

0.59 
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mus  (20) 

0.89 

0.71 

0.67 

0.82 

1 

meas  (12) 

0.77 

0.73 

0.81 

0.82 

1 

qmf  (14) 

0.84 

0.65 

0.67 

0.73 

1 

randl  (30) 

0.91 

0.77 

0.53 

0.65 

1 

rand2  (100) 

1 

0.85 

0.77 

0.73 

1 

rand3  (200) 

1 

1 

1 

0.94 

1 

Table  1 .  power  reduction  for  fixed  computation  time. 


tions  using  different  values  of  the  resimulation  threshold.  At  the  start  of  the  optimization,  all  processor  voltages  were 
set  at  5  volts.  The  throughput  at  this  point  was  used  as  the  throughput  constraint.  In  the  table,  the  first  three  rows  cor¬ 
respond  to  three  different  FFT  implementations,  mus  refers  to  a  music  synthesis  algorithm,  qmf  refers  to  a  quadrature 
mirror  filter  bank,  meas  is  a  measurement  application,  and  the  last  three  rows  correspond  to  graphs  that  were  gener¬ 
ated  using  Sih’s  algorithm  for  randomly  generating  application  graphs  [18].  The  numbers  in  parentheses  give  the 
numbers  of  nodes  in  these  applications.  The  optimization  was  performed  for  a  fixed  time  of  30  minutes  in  each  case. 
The  optimum  resimulation  threshold  was  between  2%  and  10%  in  all  cases.  For  T  =  0.25 ,  the  period  graph  is  not  a 
good  estimator  and  none  of  the  results  returned  during  the  optimization  algorithm  satisfied  the  throughput  constraint. 
For  the  largest  graph,  the  fixed  simulation  time  was  not  long  enough  to  make  much  improvement,  but  the  best  result 
occurred  for  T  =  0.1 ,  where  the  simulations  are  less  frequent.  Table  2  summarizes  the  power  reduction  for  the 
genetic  algorithm  with  and  without  using  the  period  graph,  with  a  fixed  compile  time  of  one  hour. 

9  Conclusion 

This  paper  has  explored  a  period  graph  model  that  enables  efficient  voltage  scaling  optimization  for  self-timed 
implementations  of  iterative  applications.  The  period  graph  can  be  used  as  a  computationally  efficient  estimator  for 
the  throughput  in  multiprocessor  systems  in  which  communication  contention  renders  exact  analysis  too  time-con¬ 
suming.  This  model  is  especially  useful  in  iterative  synthesis  techniques,  such  as  those  based  on  probabilistic  search. 
Our  paper  has  demonstrated  effective  voltage  scaling  techniques  based  on  incorporating  the  period  graph  into  genetic 
algorithm  and  simulated  annealing  formulations.  Other  optimizations,  such  as  exploiting  memory/speed  trade-offs  of 
the  individual  tasks,  are  also  possible.  These  may  be  more  appropriate  to  the  genetic  algorithm  and  simulated  anneal¬ 
ing  framework,  as  a  larger  set  of  independent  moves  is  available  during  optimization.  Other  useful  directions  for  fur¬ 
ther  work  include  integrating  the  period  graph  model  into  the  scheduling  phase,  rather  than  restricting  its  use  to 
voltage  scaling  of  fixed  schedules,  and  the  investigation  of  adaptive  methods  for  dynamically  adjusting  the  frequency 
of  resimulation.  The  application  graphs  are  shown  in  figures  8,  9,  10,  11,  12,  13,  14,  and  15,. 
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